---
title: Extreme Nonviable Candidates and Quota Maneuvering in Brazilian Legislative Elections
author:
- "Kristin Wylie"
- "Pedro Santos"
- "Daniel Marcelino"
date: " April 9, 2019"
output: html_notebook
# output: 
#  html_document: 
#    number_sections: yes
#    toc: yes
#    toc_depth: 2
---

```{r setup, include = FALSE, message = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  message = FALSE,
  warning = FALSE,
  comment = "",
  echo = TRUE,
  cache = FALSE,
  fig.align = "center",
  fig.width = 6,
  fig.height = 4.5,
  fig.retina = 2,
  out.width = "100%"
)
```

# Introduction

This file is intended for a reader to reproduce our analysis almost from scratch. Thus, the reader will figure out this document is quite extensive, with codes not only to performing the analysis, but also to generate and setup some variables. Keep all of the data and code (this markdown file) within the same folder.

The document is structured in 5 parts. In the first part, it shows how to install necesseray packages, to load the dataset, and to compute some basic descriptive statistics. In the second, it reports the way we've generated important variables in the analyses. In the third part, it reports some descriptive statistics that we show in the paper, and a little more. Next, it shows the estimation procedure for the ANOVA and post-hoc tests. In the last part, it demonstrates how we've estimated the logistic multilevel model with mixed effects, including some extra functions for simulation and resampling results.

The dataset we provide includes data from 1994 to 2014 elections. However, because some information for the 1994 period is not as consistent as the forward elections. Our reviewers asked to consider only the election cycles between 1998 to 2014. This file is intended to analyze the complete time range, however, if the reader wants to restrict the analysis for the period from 1998 to 2014, it's quite simple: just uncomment  the line below after you load the dataset `# bd <- subset(bd, year > 1 )`.    

# Reproducible setup 

## Packages you'll need in order to perform this reproducible file 
- The following commands will install these packages if they are not already installed.
```{r}

pks = c('dplyr','tidyr','ggplot2', 'lme4', 'compiler', 'parallel', 'boot', 'ggeffects')

for(p in pks){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
```


## Loading data 
- It is important to have the dataset file in the same folder with this file, otherwise you should make changes in the following chunk to address the dataset.

```{r}
bd <- readRDS("DB1994-2014_04.19.rda") 

# bd <- subset(bd, year > 1 )
```

## Description of the dataset 
##### What is total number of observations in the dataset? {-}
```{r}
bd %>%
  summarise(n=n())
```

##### What is the total number of unique observations in the dataset? {-}
```{r}
bd %>%
  distinct(nr_identifying) %>%
  summarise(n=n())
```

##### What is total number of observations by election year? {-}
```{r}
bd %>%
  count(year)
```

##### What is the total number of observations by election and states units? {-}
```{r}
bd%>%
  count(year, state) %>%
  spread(year,n) %>%
  data.frame() %>%
  rename("1994"= X1,"1998" = X2,"2002" = X3,"2006" = X4,"2010" = X5,"2014" = X6)
```

# Data preparation

#### This calculates the quociente minimo (fewest votes won by elected candidate per state) {-}
```{r}
bd <- bd %>%
  group_by(year, state, eleito) %>%
  mutate(quoc_min =  min(qtd_votos, na.rm=TRUE)) %>%
  ungroup()
```

#### The following adds quoc_min by state to unelected candidates' observations, and merges the two variables {-}
```{r}
bd <- bd %>%
  group_by(state, year) %>%
  mutate(quoc_min =  max(quoc_min, na.rm=TRUE)) %>%
  ungroup()
```

#### The following calculates 1% quociente minimo (threshold for laranjas) {-}
- We've selected this strategy to identify non-viable candidates 
```{r}
bd <- bd %>%
  mutate(onepctquoc_min = quoc_min *.01,
  laranja = ifelse(qtd_votos <= onepctquoc_min, 1, 0))
```

#### How many non-viable candidates we're able to identify using 1% quociente minimo threshold for identifying laranjas? {-}
```{r}
# Absolute numbers
bd %>%
  count(laranja, year) %>%
  # mutate(freq = 100*(n/sum(n))) %>%
  # select(year, laranja, freq) %>%
  spread(laranja, n) %>%
  rename("No"=`0`,"Yes"=`1`)


# Relative frequencies
bd %>%
  group_by(year, laranja) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja, freq) %>%
  spread(laranja, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```



#### The following calculates campaign finance total by state {-}
- We also tried to include campaign finance in the estimation of laranjas, but the results didn't help much 
```{r}
bd <- bd %>%
  group_by(state, year) %>%
  mutate(uf_cf =  sum(cf, na.rm=TRUE)) %>%
  ungroup()
```

#### The following calculates each candidate's share of all candidate contributions in her state {-}
```{r}
bd <- bd %>%
  mutate(cfshare = (cf/uf_cf)*100)
```

#### The following calculates `$$` minimo (fewest share of `$$` won by elected candidate per state) {-}
```{r}
bd <- bd %>%
  group_by(year, state, eleito) %>%
  mutate(uf_cfmin =  min(cfshare, na.rm=TRUE)) %>%
  ungroup()

```


#### This adds `$$` minimo by state to unelected candidates' observations, merges the two vars {-}

```{r}
bd <- bd %>%
  group_by(year, state) %>%
mutate(uf_cfmin =  max(uf_cfmin, na.rm=TRUE)) %>%
ungroup()
```


#### The following calculates 10% `$$` minimo (threshold for laranjas?) {-}
```{r}
bd <- bd %>%
  mutate(tenpctcfmin =  (uf_cfmin * .1),
         laranja_cf = ifelse(cfshare <= tenpctcfmin,1,0) )
```

#### How does 10% `$$` minimo threshold perform? {-}
```{r}
bd %>%
  group_by(year, laranja_cf) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja_cf, freq) %>%
  spread(laranja_cf, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```


#### The following calculates 1% `$$` minimo (threshold for laranjas?) {-}
```{r}
bd <- bd %>%
  mutate(onepctcfmin =  (uf_cfmin *.01 ),
         laranja_cf_one = ifelse(cfshare <= onepctcfmin,1,0) )
```

#### How does 1% `$$` minimo threshold perform? {-}
```{r}
bd %>%
  group_by(year, laranja_cf_one) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja_cf_one, freq) %>%
  spread(laranja_cf_one, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```


#### The following calculates an alternative `$$` minimo (least `$$` won by elected candidate per state) {-}
```{r}
bd <- bd %>%
  group_by(year, state, eleito) %>%
  mutate(uf_cfmin3 =  min(cf, na.rm=TRUE)) %>%
  ungroup()
```


#### This adds `$$` minimo by state to unelected candidates' observations, merges the two vars {-}
```{r}
bd <- bd %>%
  group_by(year, state) %>%
  mutate(uf_cfmin3 =  max(uf_cfmin3, na.rm=TRUE)) %>%
  ungroup()
```

#### The following calculates 10% `$$` minimo (threshold for laranjas?) {-}
```{r}
bd <- bd %>%
  mutate(tenpctcfmin3 =  (uf_cfmin3 * .1),
         laranja_cf3 = ifelse(cf <= tenpctcfmin3,1,0) )
```


#### How does 10% `$$` minimo threshold perform? {-}
```{r}
bd %>%
  group_by(year, laranja_cf3) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja_cf3, freq) %>%
  spread(laranja_cf3, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```


#### The following calculates 1% `$$` minimo (threshold for laranjas?) {-}
```{r}
bd <- bd %>%
  mutate(onepctcfmin3 =  (uf_cfmin3 * .01),
         laranja_cf_one3 = ifelse(cf <= onepctcfmin3,1,0) )
```

#### How does 1% `$$` minimo threshold perform? {-}
```{r}
bd %>%
  group_by(year, laranja_cf_one3) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja_cf_one3, freq) %>%
  spread(laranja_cf_one3, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```


#### This is what we ended up using for cf threshold {-}
```{r}
bd <- bd %>%
  group_by(year, state, eleito) %>%
  mutate(cfwinavg =  mean(cf, na.rm=TRUE)) %>%
  ungroup()
```


```{r}
bd <- bd %>%
  group_by(year, state) %>%
  mutate(cfwinavg = ifelse(eleito==1, cfwinavg, max(cfwinavg , na.rm=TRUE))) %>%
  ungroup()
```


#### The following calculates 0.05%  `$$` minimo (threshold for laranjas?) {-}
```{r}
bd <- bd %>%
  mutate(cf_quoc_min = cfwinavg * .005,
         laranja_cfavg = ifelse(cf <= cf_quoc_min,1,0))
```

#### How does 0.05% `$$` minimo threshold perform? {-}
```{r}
bd %>%
  group_by(year, laranja_cfavg) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja_cfavg, freq) %>%
  spread(laranja_cfavg, freq) %>% 
  rename("No"=`0`,"Yes"=`1`)
```

#### The following genrates `hdilow`, `hdimed`, `hdihigh`, and `hdigroup` {-}

```{r, eval=FALSE}
# hdigroup - ordinal variable based on 1994-2010 hdi_uf (PNUD & banco central)

bd <- bd %>%
  mutate(hdilow = ifelse(state %in% c("AM","AP","DF","ES","GO","MG","MS","MT","PR","RJ","RO","RR","RS","SC","SP"), 0, 1))

bd <- bd %>%
  mutate(hdimed = ifelse(state %in% c("AM","AP","ES","GO","MG","MS","MT","RO","RR"), 1,0))

bd <- bd %>%
  mutate(hdihigh = ifelse(state %in% c("DF","PR","RJ","RS","SC","SP"), 1,0))


bd <- bd %>%
  mutate(hdigroup = ifelse(hdimed ==1, 1, ifelse(hdihigh == 1, 2, 0)))
```


# Descriptive statistics and tables 

##### Vote share laranja vs non-laranja {-}
```{r}
bd %>%
  group_by(laranja) %>%
  summarise(
    count = n(),
    mean = mean(vshare, na.rm = T),
    sd = sd(vshare, na.rm = T) )
```


##### Vote share femiale vs male {-}
```{r}
bd %>%
  group_by(feminino) %>%
  summarise(
    count = n(),
    mean = mean(vshare, na.rm = T),
    sd = sd(vshare, na.rm = T) )
```


## Table A1 Quociente mínimo, by state (1994–2014)

```{r}
bd %>%
  group_by(year, state) %>%
  summarise(Q = mean(quoc_min)) %>%
  spread(year, Q) %>% data.frame() %>%
  rename(`1994`= X1,`1998`=X2,`2002`= X3,`2006`=X4,`2010`=X5,`2014`=X6)
```


## Table 2 Background of Chamber of Deputies candidates (1994–2014)

#### All candidates {-}
```{r}
# Age 
bd %>%
  group_by(year, feminino) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T)))

bd %>%
  group_by(year) %>%
  summarise(Average = mean(idade_na_data_eleicao, na.rm = T))
```

##### Color/race {-}
```{r}


bd <-  bd %>% 
  group_by(year) %>%
  mutate(afrodesc = ifelse(cor=="preta", 1, 
                     ifelse(cor=="parda", 1,0))) %>%
                   ungroup()

bd %>%
  # filter(year >= 2014) %>% # only 2014 and after
  group_by(year, afrodesc) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, afrodesc, freq) %>%
  spread(afrodesc, freq) %>%
  rename("nonblack"=`0`,"black"=`1`)


bd <-  bd %>% 
  group_by(year) %>%
  mutate(white = ifelse(cor=="branca", 1,0)) %>%
                   ungroup()

bd %>%
  # filter(year >= 2014) %>% # only 2014 and after
  group_by(year, white) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, white, freq) %>%
  spread(white, freq) %>%
  rename("nonwhite"=`0`,"white"=`1`)
```

##### Education {-}
```{r} 
# In the descriptive stats table, we show a more disaggregated version  

bd %>%
  group_by(year, educ) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, educ, freq) %>%
  spread(educ, freq) %>%
  rename("primary"=`1`,"high school"=`2`, "college"=`3`)
```

##### Occupation {-}
```{r}

# In the descriptive stats table, we show a more disaggregated version 

bd %>%
  group_by(year, fd_ocup) %>%
  summarise(N=n()) %>%
  mutate(freq= 100 * (N/sum(N, na.rm = T))) %>%
  select(year, fd_ocup, freq) %>%
  spread(fd_ocup, freq) %>%
 rename("No"=`0`,"Yes"=`1`)
```


  
## Table 3 Laranjas on the rise (1994–2014)

#### All candidates {-}
```{r}
bd %>%
  group_by(year, laranja) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T)))

bd %>%
  group_by(year, laranja) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja, freq) %>%
  spread(laranja, freq) %>%
   rename("No"=`0`,"Yes"=`1`)
```


#### Only women candidates {-}
```{r}
bd %>%
  group_by(year, laranja) %>%
  filter(feminino==1) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T)))


bd %>%
  group_by(year, laranja) %>%
  filter(feminino==1) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja, freq) %>%
  spread(laranja, freq) %>%
  rename("No"=`0`,"Yes"=`1`)
```


#### Only male candidates {-}
```{r}
bd %>%
  group_by(year, laranja) %>%
  filter(feminino==0) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T)))


bd %>%
  group_by(year, laranja) %>%
  filter(feminino==0) %>%
  summarise(N=n()) %>%
  mutate(freq= 100*(N/sum(N, na.rm = T))) %>%
  select(year, laranja, freq) %>%
  spread(laranja, freq) %>%
  rename("No"=`0`,"Yes"=`1`)
```



## Table 5 Individual, party, and district-level characteristics of laranjas (1998–2014)

#### Two independent samples t-test with equal variances for laranjas and nonlaranjas {-}


##### Laranja vs nonlaranjas for district magnitude and female {-}
```{r}
t.test(dmag ~ as.factor(laranja),
       alternative = "two.sided",
       data = subset(bd,
       feminino == 1),
       na.action = na.omit)
```

##### Laranja vs nonlaranjas for district magnitude and male {-}
```{r}
t.test(dmag ~ as.factor(laranja),
       alternative = "two.sided",
       data = subset(bd,
       feminino == 0),
       na.action = na.omit)
```

#####  Laranja vs nonlaranjas for district magnitude overall {-}
```{r}
t.test(dmag ~ as.factor(laranja),
       alternative = "two.sided",
       data =bd,
       na.action = na.omit)
```


##### Laranja vs nonlaranjas for party magnitude and female {-}
```{r}
t.test(pmag ~ laranja,
       alternative = "two.sided",
       data = subset(bd,
       feminino == 1 &
       year %in% c(2, 3, 4, 5, 6)),
       na.action = na.omit)
```


## Table 6 Proportion of total candidates and women candidates qualifying as laranjas, major parties, by ideology (1998–2014)

- This prints the complete results presented in Table A3 in the appendix.


#### Non-left ideology parties (1998-2014) {-}
##### % of total candidates that classify as laranjas {-}
```{r}
bd %>%
  filter(year>1) %>%
  group_by(left, sgl_partido, laranja) %>%
  summarise(N = n()) %>%
  mutate(freq = round(100*(N/sum(N)),1)) %>%
 filter(left==0, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()

## Total 
bd %>%
  group_by(left, laranja,year) %>%
  summarise(N = n()) %>%
  mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==0, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()
```


#### Non-left ideology parties (2014) {-}
##### % of women candidates that classify as laranjas {-}
```{r}
bd %>%
  filter(year==6, feminino == 1) %>%
  group_by(left, sgl_partido, laranja) %>%
  summarise(N = n()) %>%
  mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==0, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()


## Total 
bd %>%
  filter(year==6, feminino == 1) %>%
  group_by(left, laranja) %>%
  summarise(N = n()) %>%
   mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==0, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()
```


#### Left ideology parties (1998-2014) {-}
##### % of total candidates that classify as laranjas {-}
```{r}
bd %>%
  group_by(left, sgl_partido, laranja) %>%
  summarise(N = n()) %>%
   mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==1, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()

## Total 
bd %>%
  group_by(left, laranja) %>%
  summarise(N = n()) %>%
   mutate(freq = round(100*(N/sum(N)),1)) %>%
   filter(left==1, laranja==1) %>%
    arrange(desc(freq)) %>% data.frame()
```


#### Left ideology parties (2014) {-}
##### % of women candidates that classify as laranjas {-}
```{r}
bd %>%
  filter(year==6, feminino == 1) %>%
  group_by(left, sgl_partido, laranja) %>%
  summarise(N = n()) %>%
   mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==1, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()

bd %>%
  filter(year==6, feminino == 1) %>%
  group_by(left, laranja) %>%
  summarise(N = n()) %>%
  mutate(freq = round(100*(N/sum(N)),1)) %>%
  filter(left==1, laranja==1) %>%
  arrange(desc(freq)) %>% data.frame()
```


# Analysis of variance and post-hoc tests 

```{r}
# laranja
bd %>%
  group_by(year, laranja) %>%
  summarise(
  count = n(),
  mean = mean(vshare, na.rm = TRUE),
   sd = sd(vshare, na.rm = TRUE)
  )

# feminino
bd %>%
  group_by(year, feminino) %>%
  summarise(
  count = n(),
  mean = mean(vshare, na.rm = TRUE),
  sd = sd(vshare, na.rm = TRUE)
  )

```

## Table 4a ANOVA results of candidate vote share (1994–2014)

##### The following tests if the difference in the average of vote share is equal zero between male and female candidates overall {-}
- Notice that Anova assumptions may be violeted because the heavily skewed distribution of the vote share. Also, this Anova table was included in the paper by attending a request of one of our reviewers. However, if you analyze the means of two groups by Anova, you're essentially getting the same results as doing it with a t test.

- The output includes the columns F value and Pr(>F) corresponding to the p-value of the test. 

- The output of Tukey multiple comparisons of means (TukeyHSD) includes the difference between means of the two groups and the interval. 

```{r}
aov0 <- aov(vshare ~ as.factor(feminino), data = bd)

summary(aov0, intercept=TRUE)

TukeyHSD(aov0, conf.level=.95)
```

- From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.


```{r}
# 1994
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==1))
aov1 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==1))

summary(aov1, intercept=TRUE)

TukeyHSD(aov1, conf.level=.95)

# The alternative procedure Welch one-way test does not require the assumption of equal variances
oneway.test(vshare ~ as.factor(feminino), data = subset(bd, year==1), na.action=na.omit)
```


```{r}
# 1998
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==2))
aov2 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==2))

summary(aov2, intercept=TRUE)

TukeyHSD(aov2, conf.level=.95)
```


```{r}
# 2002
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==3))
aov3 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==3))
summary(aov3, intercept=TRUE)

TukeyHSD(aov3, conf.level=.95)
```


```{r}
# 2006
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==4))
aov4 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==4))
summary(aov4, intercept=TRUE)

TukeyHSD(aov4, conf.level=.95)
```



```{r}
# 2010
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==5))
aov5 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==5))
summary(aov5, intercept=TRUE)

TukeyHSD(aov5, conf.level=.95)
```


```{r}
# 2014
# t.test(vshare ~ as.factor(feminino), data = subset(bd, year==6))
aov6 <- aov(vshare ~ as.factor(feminino), na.action=na.omit, data = subset(bd, year==6))
summary(aov6, intercept=TRUE)

TukeyHSD(aov6, conf.level=.95)
```


## Table A2 Bonferroni Tests Evaluating Differences in Laranjas Over Time (1998–2014)

```{r}

```


# Multilevel mixed effects logistic model

- Multilevel mixed effects logistic regression model for laranja.
- We've used xtmelogit in Stata and the glmer in R to fit our generalized linear mixed-effects model, presented in Table 7.

```{r, eval=FALSE, include=FALSE}
xtmelogit laranja i.feminino##i.left i.feminino##c.pmag i.feminino##c.dmag i.hdigroup c.eff_ncands i.year##i.feminino incmbt fd_ocup2 educ cf cfsq || uf_id: feminino, laplace or var cov(un) 

# The following Stata model meets exactly the R version model 11 in this file:

xtmelogit laranja i.feminino##i.left i.feminino##c.pmag i.feminino##c.dmag i.hdigroup c.eff_ncands i.year##i.feminino incmbt fd_ocup2 educ || uf_id: feminino, laplace or var cov(un)
```


- As presented in the paper, our theoretical model can be defined as a varying intercept and slope (feminino) model. We estimated this using the Laplace approximation setup. In R, by default the parameters `n` for `nAGQ` is `1`, what corresponds to Laplace approximation. But one should keep in mind the greater the `n`, the greater the accuracy in the evaluation of the log-likelihood.

#### The following is a custom function for computing intra-class correlation {-}
- We've added a small custom function for computing intra-class correlation (ICC) of the model. If ICC is greater than 0, it suggests we're correct to think of our data to have a multilevel variance structure.

```{r}
model_icc <- function(model) {
  tau.null <- as.numeric(lapply(summary(model)$varcor, diag))
  sigma.null <- as.numeric(attr(summary(model)$varcor, "sc") ^ 2)
  ICC.null <- tau.null / (tau.null + sigma.null)
  return(ICC.null)
}
```


#### This sets some variables as factor and add labels to them {-}

```{r}
bd <- bd %>%
  mutate(
    laranja = factor(laranja, level = 0:1, labels = c("no", "yes")),
    feminino = factor(
      feminino,
      level = 0:1,
      labels = c("male", "female")
    ),
    incmbt = factor(
      incmbt,
      level = 0:1,
      labels = c("nonincumdent", "incumbent")
    ),
    left = factor(
      left,
      level = 0:1,
      labels = c("non-left", "left")
    ),
    educ =  factor(
      educ,
      level = 1:3,
      labels = c("primary", "high school", "college")
    ),
    year =  factor(
      year,
      level = 1:6,
      labels = c("1994", "1998", "2002", "2006", "2010", "2014")
    ),
    hdigroup = factor(
      hdigroup,
      level = 0:2,
      labels = c("low", "intermediate", "high")
    ),
    fd_ocup2 = factor(
      fd_ocup2,
      level = 0:1,
      labels = c("others", "feeder")
    )
  )
```


#### The null model {-}
```{r}
summary(m0 <- glmer(laranja ~ 1 + # This simply means laranja is predicted by the intercept
                    (1|uf_id), # each uf or state unit gets its own intercept
                    family = binomial("logit"),
                    data = bd,
                    control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

model_icc(m0)
```


#### Varying intercept model {-}
```{r}
summary(m1a <- glmer(laranja ~ feminino + # This means laranja is predicted by the intercept and the feminino predictor
                     (1|uf_id), # each uf or state unit gets its own intercept
                     family = binomial("logit"),
                     data = bd, control = glmerControl(optimizer = "bobyqa"),
                     nAGQ=1))

model_icc(m1a)
```


#### Varying intercept model with feminino allowed to vary as a function of the electoral districts {-}
```{r}
summary(m1b <- glmer(laranja ~ feminino +
                       (1 + feminino|uf_id), # each uf_id gets its own intercept, and feminino can vary as a function of the uf_id
                     family = binomial("logit"),
                     data = bd, control = glmerControl(optimizer = "bobyqa"),
                     nAGQ=1))
```



```{r, eval=FALSE, include=FALSE}

summary(m2 <- glmer(laranja ~ feminino * left + year * feminino +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m3 <- glmer(laranja ~ feminino * left + feminino * pmag + year * feminino +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m4 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + year * feminino +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m5 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + year * feminino +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m6 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + eff_ncands + year * feminino +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m7 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + eff_ncands + year * feminino + incmbt +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

summary(m8 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + eff_ncands + year * feminino + incmbt + fd_ocup2 +
                  (feminino|uf_id),
                  family = binomial("logit"),
                  data = bd, control = glmerControl(optimizer = "bobyqa"),
                  nAGQ=1))

summary(m9 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + eff_ncands + year * feminino + incmbt + fd_ocup2 + educ +
                    (feminino|uf_id),
                    family = binomial("logit"),
                    data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))

## with campaign finance - it does not work without rescaling cf
summary(m10 <- glmer(laranja ~ feminino * left + feminino * pmag + feminino * dmag + hdigroup + eff_ncands + year * feminino + incmbt + fd_ocup2 + educ + cf + cfsq +
                     (feminino|uf_id),
                     family = binomial("logit"),
                     data = bd, control = glmerControl(optimizer = "bobyqa"),
                     nAGQ=1))
```


## Table 7 Characteristics predicting laranja status (1998–2014)

#### The final model {-}
```{r}
summary(m11 <- glmer(laranja ~ feminino * left + feminino * pmag + hdigroup + eff_ncands + year * feminino + incmbt + fd_ocup2 + educ +
                    (feminino|uf_id),
                    family = binomial("logit"),
                     data = bd, control = glmerControl(optimizer = "bobyqa"),
                    nAGQ=1))
```


#### To report main effects and interactions instead of slopes we can convert mixed models into analysis of variance {-}
```{r}
anova(m11, ddf = "Kenward-Roger") 

# Kenny et al suggested a method that works fine: Westfall, J., Kenny, D. A.,
# & Judd, C. M. (2014). Statistical power and optimal design in experiments 
# in which samples of participants respond to samples of stimuli. 
# Journal of Experimental Psychology: General, 143(5), 2020-2045.

# anova(m11, ddf = "Satterthwaite")  # also works fine
```


#### This computes the confidence intervals (CIs) for the fixed effects estimates with exponentiated values {-}

- We can get rough estimates using the SEs, but a more rigorous approach using simulations and bootstrapping is also included next.

```{r}
se <- sqrt(diag(vcov(m11)))
# table of estimates with 95% CI
tab_m11 <- cbind(Est = fixef(m11), LL = fixef(m11) - 1.96 * se, UL = fixef(m11) + 1.96 * se)

# instead of coefficients on the logit scale, we exponentiate the estimates and CIs.
round(exp(tab_m11),3)
```


#### Compute predictive probabilities of laranja {-}
```{r}
library(ggeffects)

p <- ggpredict(m11, term = c("year", "feminino"))

plot(p)
```


```{r}
# computed values 
print(p)
```



## Simulations
- This is for robustness checks as we didn't detailed this approach in the paper

```{r}
## For single level models, we can implement a simple random sample with replacement for bootstrapping.
## With multilevel data, we want to resample in the same way as the data generating mechanism.
## We start by resampling from the highest level (states), and then stepping down one level at a time.
## In our case, we first sample from sates (uf_id), and then within each uf unit sampled, we will sample from their candidates.

sampler <- function(dat, clustervar, replace = TRUE, reps = 1) {
  cid <- unique(dat[, clustervar[1]])
  ncid <- length(cid)
  recid <- sample(cid, size = ncid * reps, replace = TRUE)
  if (replace) {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = sample(which(dat[, clustervar] == recid[i]),
                                      size = length(which(dat[, clustervar] == recid[i])), replace = TRUE))
    })
  } else {
    rid <- lapply(seq_along(recid), function(i) {
      cbind(NewID = i, RowID = which(dat[, clustervar] == recid[i]))
    })
  }
  dat <- as.data.frame(do.call(rbind, rid))
  dat$Replicate <- factor(cut(dat$NewID, breaks = c(1, ncid * 1:reps), include.lowest = TRUE,
                              labels = FALSE))
  dat$NewID <- factor(dat$NewID)
  return(dat)
}


m11_boot <- function(i) {
  object <- try(glmer(laranja ~ feminino * left + feminino * pmag + hdigroup + eff_ncands +
                      year * feminino + incmbt + fd_ocup2 + educ +
                      (feminino | uf_id), data = bigdata, subset = Replicate == i,
                      family = binomial("logit"),
                      nAGQ = 1, start = list(fixef = f, theta = r)), silent = TRUE)
  if (class(object) == "try-error")
    return(object)
  c(fixef(object), getME(object, "theta"))
}

```



```{r, eval= FALSE}
set.seed(20)
tmp <- sampler(bd, "uf_id", reps = 100)
bigdata <- cbind(tmp, bd[tmp$RowID, ])


f <- fixef(m11)
r <- getME(m11, "theta")

cl <- makeCluster(4)
clusterExport(cl, c("bigdata", "f", "r"))
clusterEvalQ(cl, require(lme4))


start <- proc.time()
res <- parLapplyLB(cl, X = levels(bigdata$Replicate), fun = m11_boot)
end <- proc.time()
start - end

# shut down the cluster
stopCluster(cl)

# calculate proportion of models that successfully converged
success <- sapply(res, is.numeric)
mean(success)


# combine successful results
bigres <- do.call(cbind, res[success])

# calculate 2.5th and 97.5th percentiles for 95% CI
(ci <- t(apply(bigres, 1, quantile, probs = c(0.025, 0.975))))


# All results
finaltable <- cbind(Est = c(f, r),
                    SE = c(se, NA), BootMean = rowMeans(bigres), ci)

# round, exponentiate and print
round(exp(finaltable),3)
```
